Introduction to OpenCL¶

OpenCL (short for Open Computing Language) is an open standard for running compute workloads on many different kinds of compute hardware (e.g CPUs, GPU's, and FPGA's). The OpenCL trademark is held by Apple, and the standard is developed and released by the Khronos group, a non-for-profit organisation that provides a focal point for the development of royalty-free standards such as OpenGL and Vulkan. The OpenCL specification itself is just a document, and can be downloaded from the Khronos website here. It is then the task of compute hardware vendors to produce software implementations of OpenCL that best make use of their compute devices.

GPU's for scientific computing?¶

Graphics processing units (GPU's) were originally designed to perform the complex calculations for computing the values for pixels in graphical applications such as 3D rendering. As this process is readily parallelisable, the rendering calculations were offloaded to specialised hardware pipelines to be performed in parallel. Eventually this specialised hardware became generalised and programmable, and GPU's became capable of other tasks like performing scientific calculations. Commercial pressure to achieve the best frame rates in games led to GPU designs that incorporate high bandwidth memory and the ability to parallelise calculations over thousands of discrete processing elements. These days GPU's have floating point performance and memory bandwidth that exceeds CPU's by as much as an order of magnitude. Below is a table of the estimated capacity of the compute devices on Setonix.

Compute device Peak FP32 processing power (TFLOP/s)
AMD EPYC 7763 1.8
AMD Radeon Instinct MI250X 2 x 23.95

As with CUDA and HIP, OpenCL is a software framework that provides a way to harness the compute power of modern GPU's. However, it also has the ability to use CPU's and FPGA's as compute devices.

OpenCL features from a distance¶

OpenCL is a mature specification that is supported and developed by the Khronos Group which is a member-led consortium of more than 150 companies. Every major computing vendor, such as AMD, NVIDIA, Intel, Xilinx and ARM has an implementation of OpenCL that follows the specification. This means that a code that is developed for one compute device has the potential to run on another compute device, even if it is from another vendor, without significant modification. In supercomputing there is a welcome change in diversity among available hardware options. OpenCL enables users to navigate this diversity without incurring significant costs when moving platforms. For some compute devices OpenCL is the only computing framework available.

While the ability to switch hardware vendors is of significant value to end users, it can be of lesser importance to some hardware vendors, who by nature prefer to devote resources to ensuring the success of their proprietary frameworks. Thus progress on OpenCL implementations tend to lag behind, both in performance and capability, when compared to frameworks like CUDA and HIP. Therefore with OpenCL on some platforms, one can expect to encounter poor to non-existent support from debugging and performance monitoring tools, as well as performance that approaches but seldom exceeds what can be achieved with proprietary frameworks.

OpenCL is a mature standard, and it has first-rate support for consistent math across compute devices, as well as support for large vector types up to 16 elements long. Shared memory support in OpenCL is first rate, and is well placed to make use of System-On-Chip architectures such as the AMD Instinct MI300.

Is OpenCL right for your project?¶

This is sometimes a difficult question to answer. Researchers often have diverse computing needs, in such cases OpenCL is a good fit as it will provide a solid and flexible foundation for your research tools. If you are looking for the best possible performance and vendor lock-in will not be an issue, then it is probably better to look at using vendor-specific tools and frameworks. However if you prefer open standards, and you need consistent results across devices, stability, support for large vector types, and great documentation that comes with a mature implementation, then OpenCL can be a solid foundation for your project. Here are some challenges and benefits in using OpenCL:

Challenges with using OpenCL

  • Performance with a vendor's compute device often lags their proprietary framework in performance.
  • OpenCL code is often verbose and has tedious amounts of boilerplate with increased potential for errors.
  • The developer has to manage kernels and compile kernels explicitly.
  • Little or no support from the vendor for debugging and performance measurement.
  • Challenges for hiding intellectual property contained within kernels.
  • Performance differences between compute devices. A kernel that works on a GPU may not work well on a CPU and vice versa.
  • Undefined behaviour across OpenCL implementations from different vendors.
  • Library support is sparse. Less chance of a ready solution being available.
  • Have to separately call functions in vendor libraries for device-specific hardware (i.e tensor or matrix cores).
  • Buffer allocations are sometimes limited to $1/4$ of available device memory (vendor specific).
  • Literature on using OpenCL is sparse.

Benefits from using OpenCL

  • Straightforward well-defined C API with great documentation.
  • Long-lived and mature standard.
  • Mature, production-quality OpenCL implementations.
  • Open standard - the standard is not (explicitly) contingent on the wellbeing of a single vendor.
  • Ability to use a wide variety of compute devices such as FPGA's and CPU's.
  • Excellent support for:
    • Standard data types
    • Long vector data types such as float16 with first rate "swizzling" operators that can perform operations such as reordering components.
    • Consistent math across devices using the IEEE754 standard.

How does OpenCL work?¶

Kernels in software threads¶

An OpenCL implementation is a framework to support running lightweight pieces of code, called kernels, in parallel over the available cores of a compute device. Below is an example kernel to compute the floating point absolute value of a single element in an array of floating point numbers.

__kernel void vec_fabs(
        // Memory allocations that are on the compute device
        __global float *src, 
        __global float *dst,
        // Number of elements in the memory allocations
        int length) {

    // Get our position in the array
    size_t gid0 = get_global_id(0);

    // Get the absolute value of 
    if (gid0 < length) {
        dst[gid0] = fabs(src[gid0]);
    }
}

In order to take the absolute value of every element we need to run this kernel at every point in the array. A software thread can be thought of as the execution of a sequence of compute instructions independently from other threads. In that sense a kernel is run in a software thread.

Hardware threads¶

A hardware thread is a pipeline of physical machinery that executes the instructions in a software thread. Compute devices have a number of cores to manage memory and execute software threads. In AMD and OpenCL terminology these cores are called Compute Units. Every compute unit makes available to the OS a number of hardware threads for running kernels. In OpenCL terminology a processing element is synonymous with a hardware thread.

GPU specifics¶

GPU's use a SIMT (Single Instruction Multiple Threads) processing model, where instructions are executed by the Compute Unit over teams of hardware threads that operate in lock-step with each other and in parallel. For AMD GPUs the team is 32-64 threads wide and is called a wavefront. For NVIDIA GPUs a team is usually 32 threads wide and is called a warp. Each hardware thread in a GPU has access to specialised floating point and integer units for perfoming math operations. These specialised units are known as shader cores (AMD) or CUDA cores (NVIDIA). There are many thousands of these units in a GPU and this feature, along with greater memory bandwidth is responsible for the significant performance advantage that GPU's have over CPU's.

The example below a graphical layout of an AMD MI250X GPU processor. Each processor contains two GPU dies; each die contains 8 shader engines; and each shader engine contains ~14 compute units, for a total of 110 Compute Units per die. Every compute unit commands a wavefront of 64 hardware threads, therefore on this processor there are two unique compute devices, each with $110\times64 = 7040$ available hardware threads for use in compute applications.

AMD Instinct™ MI250X compute architecture. Image credit: AMD Instinct™ MI200 Series Accelerator and Node Architectures | Hot Chips 34

CPU specifics¶

In CPUs each compute unit also makes available to the OS a number of hardware threads - usually 2-4. These threads are more "independent" than their GPU equivalents and are not constrained to operate in lock-step with each other. CPU hardware threads have access to SIMD vector units to perform vector math operations, however this hardware is only accessed through special vector instructions that the compiler conservatively generates if it deems it is safe to do so. OpenCL has vector data types whose math operations map easily to SIMD vector instructions, thus making it easier to build software that uses SIMD vector units in CPUs.

Workgroups and work-items as part of the Grid¶

An OpenCL implementation is a way to run kernel instances on processing elements (hardware threads) as they become available. The implementation also provides the means to upload and download memory to and from compute devices. We specify how many kernel instances we want at runtime by defining a 3D execution space called a Grid and setting its size at kernel launch. Every point in the Grid is called a work-item, and represents a unique invocation of the kernel. A work-item is equivalent to a single kernel invocation. This is much like defining an execution space using nested loops, however with OpenCL there are no guarantees on the order in which work items are completed.

Three-dimensional Grid with work-items and work-groups.

Work-items are executed in teams called work-groups. In the example above, the grid is of global size (10, 8, 2) and each work-group is of size (5,4,1). The the number of work-groups in each dimension is then (2,2,2). Every work-item has access to device memory that it can use exclusively (private memory), access to memory the team can use (local memory), and access to memory that other teams use (global and constant memory). Every kernel invocation or work-item can query its location within the Grid and use that position as a reference to access allocated memory on the compute device at an appropriately calculated offset.

Using the location within the Grid to access memory within a memory allocation on a GPU compute device.

The above concepts form the core ideas surrounding OpenCL. Everything that follows in this course is supporting information on how to prepare compute devices, manage memory, invoke kernels, and how best to use these concepts together to get the best performance out of your compute devices.

Elements of an accelerated application¶

In every accelerated application there is the concept of a host computer on which there are one or more compute devices. The host usually has the largest memory space available and the compute device usually has the most compute power and memory bandwidth. This is why we say the application is "accelerated" by the compute device.

At runtime, the host executes the application. During execution kernels are compiled for available compute devices. The host program manages memory allocations on the compute device and executes compiled kernels on the compute device. For instances where the compute device is a CPU the host CPU and the compute device are the same thing.

Accelerated applications follow the same logical progression of steps:

  1. Compute resources discovered
  2. Kernels compiled for compute device/s
  3. Memory allocated on compute device/s
  4. Memory is copied from the host to the compute device/s
  5. Kernels run on the compute device/s
  6. The host waits for kernels to finish
  7. Memory is copied back from the compute device/s to the host
  8. Repeat steps 3 - 8 as many times as necessary
  9. Clean up resources and exit

We now discuss the OpenCL components that make these steps possible.

Taxonomy of an OpenCL application¶

Below is a representation of the core software components that are available to an OpenCL application.

Components of an OpenCL application.

The first is the Platform. This is a software representation of a vendors implementation. A platform provides access to all devices that the platform supports. During device discovery, available platforms are queried first. A platform provides access to one or more compute devices and possibly even a mixture of accelerator devices from the same vendor.

A Device provides a way to query the capabilites of the compute device and provides a foundation to build a Context.

Surrounding the devices is a Context. A Context is a resource manager that keeps track of everything (i.e kernel executions and memory allocations) that are happening on the compute device/s. A Context is constructed on using both a platform and one or more devices on the platform. There are some benefits (such as memory copies) that could be obtained by encapsulating one or more devices under the same context, however this assumes that devices must belong to the same platform - an assumption which may not be true. A simpler and more general design is to create a unique context for every compute device.

Within the control of the Context are Buffers. Buffers are memory allocations managed under the context, and may exist on either the host or the compute device. At runtime memory is migrated to where is needed, but you can have some control over where a buffer "lives".

At runtime, source code for the kernels are collated into Programs. This is repeated for every utilised context. In a subsequent step programs are built for every utilised compute device in a context

Once a context has been created and devices are known, then one can create one or more Command queue/s for each device. A command queue is a place to submit work, such as kernel invocations and memory copies. When multiple command queues are used then it can leverage the asynchronous capabilities of the compute device, such as being able to copy during compute.

A Kernel is a function that is part of a compiled Program. At runtime we set the arguments of a compile kernel and then submit the kernel to a command queue for execution. We can keep track of the status of a command submitted to the command queue using an Event.

In summary we have the following components:

  • Platform: provides access to devices
  • Device: represents a way to access the compute device and to query device capabilities
  • Context: provides a way to create buffers and keep track of what is happening on compute devices
  • Buffer: provides a way to allocate memory on devices
  • Program: provides a way to aggregate kernels for each context and then build those kernels for each compute device in the context
  • Command queue: provides a place to send work such as memory copy commands and kernel executions
  • Kernel: provides a way to do work on a compute device
  • Event: provides a way to keep track of work submitted to a command queue

Specification Roadmap¶

From Wikipedia OpenCL was originally designed by Apple, who developed a proposal to submit to the Khronos group and holds the trademarks to OpenCL. The first specification, OpenCL 1.0, was ratified on November 18, 2008 and the first public release of the standard was on December 2008. Since then a number of different versions of the standard have been released.

Version 1.1 introduced thread safety so that calls to most OpenCL functions from different threads didn't introduce race conditions. If memory allocations in buffers are used to represent 2D and 3D arrays, then Version 1.1 introduced routines to copy rectangular regions of those buffers to and from the host.

Version 1.2 is probably the most significant release of OpenCL. It remained the defacto OpenCL standard for at least 10 years. Abilities such as being able to divide the processing elements of a compute device into sub-devices that share a common cache and offline compilation of kernels were useful. The introduction of math operations that conform to the IEEE754 precision standard meant consistent results across heterogeneous compute architectures.

Version 2.0 introduced support for Shared Virtual Memory (SVM). Implementation of SVM meant we no longer needed to qualify which space (i.e global, local..) a memory allocation belonged to, and memory could be transferred to and from devices transparently to the user. This was too much for some vendors to implement however, and a few vendor implementations remained at 1.2 for a number of years.

Version 2.1 brought the SPIR-V (Standard Portable Intermediate Representation) language to OpenCL. During compilation a open-source compiler can take C or C++ kernel code and emit a compiled program as SPIR-V intermediate code. At runtime this program is loaded by the application and passed to the vendor driver for further compilation to binary code that can run on the compute device. This is a significant advancement, because if a vendor can implement support for SPIR-V then it dramatically reduces the number of intermediate representations the vendor compiler must support. It also offloads support for kernel language advancements to the open source compiler and provides a measure of security against intellectual property theft.

Version 2.2 allowed kernels to be produced using a subset of the C++14 standard. It also updated support for SPIR-V to version 1.2. The combination of shared virtual memory, C++ kernels, and SPIR-V support meant that very few vendors actually succeeded in producing viable implementations of OpenCL 2.2, and OpenCL stagnated for a period of 5 years.

Version 3.0 addressed the issue of stagnation by making Version 1.2 standard and all the other improvements in Version 2.x optional. This gave vendors freedom to implement what they wanted for customers and gave the standard some breathing room. Version 3.0 also introduced a new C++ language for kernels (called C++ for OpenCL) that uses a subset of the C++17 standard. The Clang compiler supports compilation of C++ for OpenCL kernels into SPIR-V format.

Below is a summary of major features implemented with each release:

Specification Release year Specifics
1.0 2008 Initial implementation
1.1 2010
  • Copy rectangular sections to and from Buffers
  • User-defined Events
  • 3-component vector types
  • Support for making Buffers from Buffers (sub-Buffers)
  • Thread safety for all functions except setting kernel arguments
1.2 2011
  • The ability to divide a compute device into sub-devices, that share a cache for example
  • Offline compilation of kernels
  • Support for built-in kernels (i.e for FPGA's)
  • IEEE754 compliance for consistent math across devices
  • Enabling double precision math
  • Support for using printf in kernels for debugging
2.0 2013
  • Support for Shared Virtual Memory (SVM)
  • The ability to run kernels from kernels
  • Enhanced support for Images (specialised Buffers)
  • Simplified atomics
  • Pipe storage
  • Double precision IEEE754 operations
2.1 2015
  • Low latency timers for profiling events
  • The ability to make some command queues have higher priority than others
  • Introduces the SPIR-V 1.1 intermediate language for compiled kernels
  • The ability to clone kernels
2.2 2015
  • A subset of C++14 for Kernels
  • The ability to make some command queues have higher priority than others
  • Updates the SPIR-V intermediate language to version 1.2
3.0 2020
  • Version 1.2 is canon, everything else is optional
  • The ability to call a function when a context is destroyed
  • Introduces C++ for OpenCL

Vendor implementations¶

All of the major vendors have OpenCL implementations at varying levels of support for the OpenCL specification. The table below shows the latest known level of support for each version of the specification, along with links to the vendor's OpenCL developer page.

Vendor 1.2 2.0 2.1 2.2 3.0
AMD Y Y Y Some N
Apple Y N N N N
ARM Y Y Y N Y
Intel Y Y Y Some Y
NVIDIA Y N N N Y
Portable OpenCL Y Some N N Y

Apple was the original vendor for OpenCL and it comes baked into the MacOS operating system. However the company has since moved on to their proprietary framework Metal and they haven't invested in OpenCL beyond specification 1.2. Support for OpenCL is built in to NVIDIA's CUDA toolkit, though after an initial flurry of development activity up to version 1.2, development stalled until version 3.0. Support for OpenCL with AMD is part of the ROCM suite. Intel strongly supports OpenCL development for CPU's and GPU's with its oneAPI toolkit. The CPU implementation in oneAPI also works for AMD CPU's, which is really good! ARM has solid support for OpenCL on its Mali GPU's. The open source POCL (Portable OpenCL) implementation has a CPU implementation as well as support for OpenCL on CUDA and OpenCL on MacOS.

Conformance¶

A conformant OpenCL implementation is an implementation of OpenCL that has passed Khronos' test suite. The number of vendors with conformant implementations is an evolving list, click here to see the latest conformant implementations.

Documentation for OpenCL¶

The best source of help for OpenCL is Khronos OpenCL registry. There you can find excellent documentation on the latest specification that your vendor supports. Let's download the latest API specification in PDF format and have it ready as reference material.

Exercise:¶

Download from the Khronos OpenCL registry the latest OpenCL API and C language specifications to your computer.

OpenCL software compilation process¶

Just to avoid confusion, there are two compilation steps for OpenCL applications:

  1. Compiling the application itself before execution
  2. Compiling kernels from within an application during execution

During program execution, kernels are combined into programs and the programs are compiled for each compute device using the vendor's kernel compiler. Thankfully, when compiling an OpenCL application prior to execution (Step 1), we don't need to link against every available implementation. We just need to link against a single library file called the Installable Client Driver (ICD) Loader that may be provided by any vendor. The ICD loader has the name (opencl.dll) on Windows and (libOpenCL.so) on Linux. Accompanying the ICD loader are header files (opencl.h for C and opencl.hpp for C++) that live in a directory called CL. This directory must be in the include path of an application at compilation. It is good practice to source an ICD loader (i.e libOpenCL.so) and OpenCL headers directly from the Khronos Github site, because then you will always have the latest up-to-date OpenCL API. Alternatively, you can use a ICD loader provided by the vendor, but it may not be up with the latest OpenCL standard.

In the source program include the OpenCL headers. When using the C API you can just include the file opencl.h like this:

#include <CL/opencl.h>

Or, if you like to use the C++ API you can include the C++ headers, like this:

#include <CL/opencl.hpp>

We will be using the C API for this course, both to maintain compatibility with C and because the documentation for the C API is much better! Compilation of an OpenCL source file on Linux usually takes this form:

<compiler> -I<path to CL directory> -L<path to ICD loader> <source file> -o <destination program> -lOpenCL

When a program launches, the ICD loader looks up a directory to find OpenCL implementations. On Linux this directory is usually

/etc/OpenCL/vendors

However you can change the place to look for implementations by setting the environment variable OCL_ICD_VENDORS to another directory. Inside the vendors directory are a number of *.icd files, one for each vendor implementation of OpenCL. The contents of those files are just the libary path for the vendor's OpenCL implementation. The ICD loader takes care of intercepting all OpenCL library calls and routing them to that of the appropriate vendor implementation, depending on the compute device being used. The routing process happens transparently to the user.

Exercise: compiling your first OpenCL application¶

At the location hello_devices.cpp is a complete OpenCL application to obtain the size of on-device memory and the maximum Buffer size that is possible within that memory.

  • Step 1. From the Jupyter launcher start a Terminal and use cd to navigate to the src/L1_Introduction directory in the course material
cd src/L1_Introduction
  • Step 2. You need to know where the OpenCL ICD loader and OpenCL header files are located. On a Linux distribution the files could be available (but may not be) in the following locations.
File Directory
ICD loader (libOpenCL.so) /usr/lib/x86_64-linux-gnu
OpenCL C++ headers directory (CL) /usr/include

In the Terminal use ls to list the contents of these directories and locate the CL directory in which the OpenCL header files are located. On Linux you can add the location of the CL directory to your CPATH environment variable, and the location of libOpenCL.so to both your LIBRARY and LD_LIBRARY_PATH environment variables. Then you won't need to explicity tell the compiler where the OpenCL resources are. Compile the application source file hello_devices.cpp using the g++ compiler. The compilation command should look like this:

g++ -g -O2 hello_devices.cpp -o hello_devices.exe -lOpenCL
  • Step 4. Now run the application
./hello_devices.exe

You should see at least one device printed with the name and memory sizes. Now that you know how to let the compiler know about OpenCL you can use the make command within that directory to compile the example.

In [3]:
!make clean; make; ./hello_devices.exe
rm -rf *.exe
CC -g -fopenmp -O2 -I../include hello_devices.cpp -o hello_devices.exe -lOpenCL
Platform 0
	         Device index: 0
	                 name: gfx1035 
	   global memory size: 536 MB
Platform 1
	         Device index: 1
	                 name: NVIDIA GeForce RTX 3060 Laptop GPU 
	   global memory size: 6226 MB
Platform 2
	         Device index: 2
	                 name: AMD Ryzen 7 6800H with Radeon Graphics          
	   global memory size: 32865 MB
Platform 3
	         Device index: 3
	                 name: Intel(R) FPGA Emulation Device 
	   global memory size: 32865 MB

This application is rather rudimentary, however there is a far more sophisticated OpenCL query application called clinfo. You can use it to query a great deal on information on the available devices. Here we use clinfo to query available platforms and devices.

In [2]:
!clinfo -l
Platform #0: AMD Accelerated Parallel Processing
 `-- Device #0: gfx1035
Platform #1: NVIDIA CUDA
 `-- Device #0: NVIDIA GeForce RTX 3060 Laptop GPU
Platform #2: Intel(R) OpenCL
 `-- Device #0: AMD Ryzen 7 6800H with Radeon Graphics         
Platform #3: Intel(R) FPGA Emulation Platform for OpenCL(TM)
 `-- Device #0: Intel(R) FPGA Emulation Device

Resources¶

Written by Dr. Toby Potter of Pelagos Consulting and Education for the Pawsey Supercomputing Research Centre. All trademarks mentioned are the property of their respective owners.